Automatic Thesaurus Generation from Raw Text using Knowledge-Poor Techniques
نویسنده
چکیده
In addition to showing how lexical units are related within a eld, domain-speciic thesauri give an idea of what subjects are important to that eld and are thus useful at many points in an information system. The major impediment to creation of thesauri has been the cost of their manual creation. We present here a number of automatic techniques that jointly produce a rst draft of a thesaurus from any domain-deening collection of text. The techniques are knowledge-poor in that no domain knowledge is required for their use. We have successfully applied these techniques to over twenty corpora ranging from 1 to 6 megabytes. Results from the thesaurus produced from a collection of medical abstracts will also be presented here.
منابع مشابه
Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus
Fully Automatic Thesaurus Generation (ATG) seeks to generate useful thesauri by mining a corpus of raw text. A number of statistical approaches, based on term co occurrence, exist for this, but in general they are only able to estimate the strength of the relationship between two terms, not its nature. In this paper we implement Hearst's method of discovering the hyponymy relations which are t...
متن کاملInformation Retrieval Tasks
Techniques of automatic natural language processing have been under development since the earliest computing machines, and in recent years these techniques have proven to be robust, reliable and efficient enough to lead to commercial products in many areas. The applications include machine translation, natural language interfaces and the stylistic analysis of texts but NLP techniques have also ...
متن کاملEvaluation Techniques For Automatic Semantic Extraction: Comparing Syntactic And Window Based Approaches
As large on-line corpora become more prevalent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any specific application, comparing the results of these attempts is difficult. Here we propose an evaluation method using gold standards, i.e., pre-existing hand-compiled resources, as a means of...
متن کاملEvaluation Techniques for Automatic SemanticExtraction : Comparing Syntactic and Window
As large on-line corpora become more prevalent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any speciic application, comparing the results of these attempts is diicult. Here we propose an evaluation method using gold standards , i.e., pre-existing hand-compiled resources, as a means of c...
متن کاملA Method for Re ning Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results
Knowledge-poor corpus-based approaches to natural language processing are attractive in that they do not incur the diiculties associated with complex knowledge bases and real-world inferences. However, these kinds of language processing techniques in isolation often do not suuce for a particular task; for this reason we are interested in nding ways to combine various techniques and improve thei...
متن کامل